Staleness-Aware Async-SGD for Distributed Deep Learning

نویسندگان

Wei Zhang

Suyog Gupta

Xiangru Lian

Ji Liu

چکیده

This paper investigates the effect of stale (delayed) gradient updates within the context of asynchronous stochastic gradient descent (Async-SGD) optimization for distributed training of deep neural networks. We demonstrate that our implementation of Async-SGD on a HPC cluster can achieve a tight bound on the gradient staleness while providing near-linear speedup. We propose a variant of the SGD algorithm in which the learning rate is modulated according to the gradient staleness and provide theoretical guarantees for convergence of this algorithm. Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and ImageNet to demonstrate the effectiveness of the proposed approach. Additionally, our experiments show that there exists a fundamental tradeoff between model accuracy and runtime performance that places a limit on the maximum amount of parallelism that may be extracted from this workload under the constraints of preserving the model quality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

YellowFin and the Art of Momentum Tuning

Adaptive Optimization Hyperparameter tuning is a big cost of deep learning. Momentum: a key hyperparameter to SGD and variants. Adaptive methods, e.g. Adam1, don’t tune momentum. YellowFin optimizer • Based on the robustness properties of momentum. • Auto-tuning of momentum and learning rate in SGD. • Closed-loop momentum control for async. training. Experiments ResNet and LSTM YellowFin runs w...

متن کامل

Asynchronous Byzantine Machine Learning

Asynchronous distributed machine learning solutions have proven very effective so far, but always assuming perfectly functioning workers. In practice, some of the workers can however exhibit Byzantine behavior, caused by hardware failures, software bugs, corrupt data, or even malicious attacks. We introduce Kardam, the first distributed asynchronous stochastic gradient descent (SGD) algorithm t...

متن کامل

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trad...

متن کامل

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the ...

متن کامل

Probabilistic Synchronous Parallel

Most machine learning and deep neural network algorithms rely on certain iterative algorithms to optimise their utility/cost functions, e.g. Stochastic Gradient Descent (SGD). In distributed learning, the networked nodes have to work collaboratively to update the model parameters, and the way how they proceed is referred to as synchronous parallel design (or barrier control). Synchronous parall...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Staleness-Aware Async-SGD for Distributed Deep Learning

نویسندگان

چکیده

منابع مشابه

YellowFin and the Art of Momentum Tuning

Asynchronous Byzantine Machine Learning

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

How to scale distributed deep learning?

Probabilistic Synchronous Parallel

عنوان ژورنال:

اشتراک گذاری